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Abstract 

Background:  Molecular-docking-based  virtual  screening  is  an  important  tool  in  drug  discovery 
that  is  used  to  significantly  reduce  the  number  of  possible  chemical  compounds  to  be  investigated. 
In  addition  to  the  selection  of  a  sound  docking  strategy  with  appropriate  scoring  functions,  another 
technical  challenge  is  to  in  silico  screen  millions  of  compounds  in  a  reasonable  time.  To  meet  this 
challenge,  it  is  necessary  to  use  high  performance  computing  (HPC)  platforms  and  techniques. 
However,  the  development  of  an  integrated  HPC  system  that  makes  efficient  use  of  its  elements  is 
not  trivial. 

Results:  We  have  developed  an  application  termed  DOVIS  that  uses  AutoDock  (version  3)  as  the 
docking  engine  and  runs  in  parallel  on  a  Linux  cluster.  DOVIS  can  efficiently  dock  large  numbers 
(millions)  of  small  molecules  (ligands)  to  a  receptor,  screening  500  to  1,000  compounds  per 
processor  per  day.  Furthermore,  in  DOVIS,  the  docking  session  is  fully  integrated  and  automated 
in  that  the  inputs  are  specified  via  a  graphical  user  interface,  the  calculations  are  fully  integrated 
with  a  Linux  cluster  queuing  system  for  parallel  processing,  and  the  results  can  be  visualized  and 
queried. 

Conclusion:  DOVIS  removes  most  of  the  complexities  and  organizational  problems  associated 
with  large-scale  high-throughput  virtual  screening,  and  provides  a  convenient  and  efficient  solution 
for  AutoDock  users  to  use  this  software  in  a  Linux  cluster  platform. 


Background 

In  the  last  several  years,  virtual  screening  has  become  an 
accepted  tool  in  drug  discovery.  It  has  been  successfully 
applied  in  a  number  of  therapeutic  programs,  in  particu¬ 
lar,  at  the  lead  discovery  stage,  where  high-throughput 
molecular  docking  can  play  an  important  role  [1].  In  con¬ 
cert  with  the  continued  need  for  improvements  of  in  silico 


docking  accuracies,  the  explosive  growth  of  commercial 
and  publicly  available  chemical  databases  requires  com¬ 
putational  techniques  to  efficiently  implement  docking 
protocols  and  rapidly  screen  millions  of  compounds  in  a 
timely  fashion.  Here,  we  are  focusing  on  the  techniques  to 
enable  large  scale  docking  using  Linux-based  HPC  plat¬ 
forms. 
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Several  commercial  docking  programs,  such  as  Glide  [2], 
LigandFit  [3]  and  FlexX  [4],  can  distribute  docking  jobs  to 
computers  over  the  network.  However,  protocols  that  can 
seamlessly  dock  millions  of  compounds  and  capture  the 
top  percentage  of  high-scoring  ligands  are  not  standard. 
There  are  two  major  requirements  for  such  a  protocol  run¬ 
ning  on  a  Linux  cluster:  (1)  the  ability  to  launch  parallel 
docking  jobs  through  a  queuing  system;  and  (2)  the  abil¬ 
ity  to  process  millions  of  compounds  in  a  reasonable 
time.  Since  the  latter  requirement  may  call  for  hundreds 
of  central  processing  units  (CPUs)  working  simultane¬ 
ously,  the  protocol  should  effectively  handle  the  associ¬ 
ated  data  flow  through  the  file  system  without  affecting 
the  performance  of  the  cluster.  In  this  note,  we  describe  a 
Linux  cluster-based  protocol  using  AutoDock  [5]  as  the 
docking  engine. 

AutoDock  is  a  widely  used  docking  program  developed  at 
the  Scripps  Research  Institute.  Application  of  AutoDock 
requires  several  separate  pre-docking  steps,  e.g.,  ligand 
preparation,  receptor  preparation,  and  grid  map  calcula¬ 
tions,  before  the  actual  docking  process  can  take  place. 
Existing  tools,  such  as  AutoDock  Tools  (ADT)  [6]  and  BDT 
[7],  integrate  individual  AutoDock  steps  within  a  graphi¬ 
cal  user  interface  (GUI),  and  provide  automated  features 
for  docking  runs.  However,  they  do  not  contain  the  capa¬ 
bility  to  effectively  process  millions  of  compounds  in  a 
single  execution.  At  the  AutoDock  Website  [8],  there  is  a 
tutorial  with  scripts  to  teach  users  how  to  use  AutoDock 
and  how  to  employ  UNIX  commands  to  perform  virtual 
screening.  In  the  tutorial,  first  a  user  needs  to  manually 
prepare  the  receptor,  ligands,  energy  grids  and  a  list  of  lig¬ 
ands  names;  then  every  compound  in  the  list  is  looped 
through  by  a  shell  script,  where  the  compound  can  be 
docked  either  by  executing  AutoDock  commands  inside 
the  loop  or  by  submitting  the  corresponding  commands 
to  a  queuing  system.  This  approach  works  well  for  thou¬ 
sands  of  compounds,  however,  it  may  not  work  for  mil¬ 
lions  of  compounds  due  to  the  limitations  of  the  file  and 
queuing  systems.  Concurrently,  AutoDock  has  also  been 
used  in  grid-computing  projects,  FightAIDS@Home  [9] 
and  WISDOM  [10].  In  a  grid-computing  infrastructure, 
tens  to  hundreds  of  thousand  CPUs  are  used  to  setup  a 
computational  grid.  Through  specialized  middleware, 
servers  in  a  grid  can  schedule  jobs,  send  applications  and 
data  to  a  CPU,  and  retrieve  results.  Because  of  the  large 
number  of  CPUs  involved,  the  computational  power  of  a 
grid  is  phenomenal.  For  example,  in  the  WISDOM 
project,  10  million  docking  experiments  were  performed 
using  AutoDock  in  20  days.  However,  due  to  the  nature  of 
its  infrastructure,  the  job  success  rate  was  65%  and  the 
percentage  of  time  spent  running  the  application  was 
~50%.  On  the  other  hand,  using  a  dedicated  HPC  plat¬ 
form  (e.g.,  a  128-node  Linux  cluster)  with  specialized  soft¬ 
ware  tools  for  AutoDock,  it  is  possible  to  screen  millions 


of  compounds  in  a  reasonable  time  at  a  much  higher  job 
success  rate  and  CPU  efficiency. 

Here,  we  describe  a  DOcking-based  Virtual  Screening 
(DOVIS)  pipeline,  based  on  AutoDock  (version  3),  where 
Perl  and  shell  scripts  are  used  to  integrate  executables  and 
scripts  from  ADT,  OpenBabel  [11]  and  AutoDock.  This 
implementation  has  the  following  advantages:  (1)  a  scal¬ 
able  parallelization  scheme  for  AutoDock  integrated  with 
queuing  systems,  such  as  the  Load  Sharing  Facility  (LSF), 
Platform  Computing  Inc.  (Ontario,  Canada)  and  the  Port¬ 
able  Batch  System  (PBS),  Altair  Grid  Technologies  (Troy, 
MI);  (2)  a  protocol  to  retain  user-specified  top  percentage 
of  docked  ligands  based  on  their  docking  scores;  (3)  a  X- 
window's-based  GUI  for  users  to  specify  docking  parame¬ 
ters,  submit  docking  jobs  and  query/visualize  docked  lig¬ 
ands;  and  (4)  a  collection  of  pre-processed  compounds 
from  the  ZINC  database  [12]  and  the  National  Cancer 
Institute  (NCI)  diversity  dataset  [13]  in  native  AutoDock 
pdbq  format. 

Implementation 

A  high-throughput  screening  campaign  typically  has 
many  more  ligands  (millions)  than  the  number  of  CPUs 
(hundreds)  available.  In  this  case,  a  straightforward 
approach  to  process  the  ligands  is  via  a  parallelization 
scheme  where  the  input  ligands  (or  database)  are  divided 
into  N  equal  partitions  corresponding  to  the  number  of 
CPUs  available  for  docking.  For  each  CPU,  a  single  parti¬ 
tion  of  ligands  is  docked  to  the  target  receptor.  After  all 
docking  jobs  are  completed,  the  results  from  the  N  CPUs 
are  consolidated  and  the  final  (user  specified)  top  M  scor¬ 
ing  ligands  are  returned  to  the  user  for  further  analysis. 

The  input  parameters  for  screening  are  specified  via  a  GUI, 
which  creates  one  master  parameter  file  that  is  used  to 
drive  all  related  scripts  and  programs.  Inside  DOVIS  there 
are  three  distinct  implementation  steps,  pre-docking,  par¬ 
allel  docking,  and  post-docking,  which  are  integrated  with 
a  queuing  system. 

Pre-docking 

In  this  step,  the  receptor  is  converted  to  the  native  Auto¬ 
Dock  format  and  the  ligands  are  partitioned  into  N  files. 
DOVIS  accepts  receptors  with  all  hydrogen  atoms  speci¬ 
fied  in  standard  pdb  or  mol2  format.  A  Python  script, 
"prepare_receptor.py"  from  ADT,  is  used  to  convert  the 
receptor  into  the  pdbqs  format  required  by  AutoDock. 
The  acceptable  input  formats  for  ligands  include  sd,  mol2, 
and  pdbq.  When  the  ligand  is  provided  in  the  sd  format, 
it  is  converted  into  mol2  by  OpenBabel.  The  mol2  files  are 
then  partitioned  into  N  files  with  roughly  the  same 
number  of  molecules  in  each  file.  If  the  input  ligands  are 
provided  as  pre-processed  pdbq  files,  the  list  of  ligand  file 
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names  is  simply  partitioned  into  N  lists  of  roughly  equal 
length. 

Parallel  Docking 

From  the  previous  step,  either  a  ligand  file  (containing 
multiple  ligands)  or  a  list  of  ligands  is  assigned  to  each 
CPU,  where  the  ligands  in  each  set  are  separately  proc¬ 
essed  and  docked  to  the  receptor,  one  ligand  at  a  time  per 
CPU.  For  ligands  in  the  mol2  format,  a  Python  script, 
"prepare_ligand.py"  from  ADT,  is  used  to  generate  the 
corresponding  pdbq  file.  We  pre-compute  energy  grid 
maps  for  all  possible  ligand  atom  types  at  the  beginning 
of  each  parallel  job.  Thus,  any  grid  map  needed  by  Auto- 
Dock  can  be  directly  loaded  without  a  separate  calcula¬ 
tion.  After  each  docking,  the  docking  log-file  (dig)  is 
parsed  by  a  Perl  script  to  extract  the  "estimated  free  energy 
of  binding, "  which  is  used  as  the  criterion  to  select  the  top 
M  scoring  dig  files  from  each  CPU. 

During  testing,  we  found  that  there  are  more  than  10  file 
operations  and  over  10  Mb  of  data  flow  associated  with 
the  docking  of  each  ligand.  If  a  Network  File  System  (NFS) 
is  used  with  more  than  100  CPUs  running  concurrently, 
the  data  flow  degrades  the  performance  of  the  entire  Linux 
cluster.  To  overcome  this  problem,  DOVIS  provides  the 
option  to  use  the  local  disk  drive  on  each  node  or  CPU  to 
store  the  energy  grid  maps  and  other  intermediate  files. 

Post-docking 

After  all  docking  jobs  are  completed,  a  Perl  script  is  used 
to  combine  all  saved  results  from  each  CPU  and  to  collect 
the  top  M  scoring  ligands  from  the  consolidated  list  as  the 
final  result.  The  dig  files  of  all  selected  top-scoring  ligands 
are  collected  and  compressed  into  a  directory. 

Queue  Integration 

On  large  Linux  clusters  with  distributed  memory,  it  is 
essential  to  integrate  DOVIS  with  a  queuing  system.  With 
the  job  dependency  support  of  a  queuing  system,  the 
three  steps  discussed  above  can  be  sequentially  executed 
and  parallel  jobs  can  be  automatically  launched.  For  the 
parallel  docking  job,  at  each  CPU,  the  same  executables 
are  used  to  process  different  ligands.  This  computational 
strategy  is  most  efficiently  handled  by  queuing  systems 
that  support  the  functionality  to  bundle  many  single¬ 
processor  jobs  using  the  same  executables  into  a  single 
job,  typically  referred  to  as  Job  Array.  Thus,  DOVIS  is  suit¬ 
able  for  any  queuing  system  that  supports  Job  Array  and 
job  dependency.  Currently,  we  have  integrated  DOVIS 
with  the  LSF  and  the  PBS  queuing  systems. 

In  order  to  run  DOVIS  on  Linux  clusters  without  a  queu¬ 
ing  system,  we  provide  a  version  of  DOVIS  that  uses 
multi-threading  to  run  parallel  docking  jobs.  This  scheme 
is  especially  suitable  for  shared-memory  Linux  clusters. 


There  is  another  issue  related  to  shared  queuing  systems. 
Usually,  every  job  in  the  queue  has  a  runtime  limit.  Once 
the  limit  is  reached,  the  job  is  terminated.  When  docking 
large  number  of  ligands,  Job  Array  may  not  be  able  to 
complete  all  tasks  within  the  allowed  runtime  limit. 
Therefore,  we  implemented  a  restart  function,  which 
tracks  the  progress  of  each  CPU  and  gives  the  user  the 
option  to  manually  restart  DOVIS  from  where  it  left  off. 

Graphic  User  Interface 

A  GUI  was  developed  for  DOVIS  using  Java  Swing  to  pro¬ 
vide  a  convenient  way  to  specify  the  target  receptor,  the 
ligand  database  and  the  docking  parameters.  Using  the 
GUI,  users  are  also  able  to  submit  the  job  to  the  queuing 
system.  In  addition,  the  JMol  [14]  molecular  viewer  is 
embedded  in  the  GUI  to  enable  visualization  of  docked 
ligands  with  the  target  receptor. 

Results 

In  order  to  build  a  ligand  database  for  testing  and  scien¬ 
tific  research  purpose,  we  pre-processed  the  ZINC  data¬ 
base  (version  5,  2.07  million  compounds)  and  the  NCI 
diversity  dataset  (1,990  compounds)  into  AutoDock 
pdbq  format.  Using  the  pre-processed  ZINC  database,  we 
tested  how  many  ligands  DOVIS  could  dock  to  a  receptor 
per  day  with  varying  numbers  of  CPUs.  We  performed 
tests  with  up  to  128  CPUs  on  a  Linux  cluster  and  observed 
a  near-linear  speedup  as  a  function  of  number  of  CPUs, 
for  the  ZINC  database.  This  indicates  that  our  implemen¬ 
tation  achieves  near-optimal  performance.  In  addition, 
we  carried  out  the  virtual  screening  of  the  ZINC  database 
against  the  ricin  A  chain  (267  amino  acids)  as  a  receptor 
target  with  256  CPUs.  The  task  was  completed  in  12  days, 
corresponding  to  ~700  ligands  per  CPU  per  day,  with  two 
manual  restarts  due  to  a  four-day  runtime  limit  in  our 
queuing  system.  See  additional  file  1  for  the  detailed 
parameter  choices. 

Discussion 

We  developed  DOVIS  as  a  utility  software  to  automate 
docking  jobs  with  AutoDock.  It  can  reliably  screen  mil¬ 
lions  of  compounds  against  a  receptor  and  automatically 
save  the  top  percentage  of  high-scoring  hits.  The  paralleli¬ 
zation  scheme  employed  here  provides  a  straightforward 
approach  to  such  a  problem.  When  all  CPUs  are  started 
around  the  same  time,  this  scheme  works  well  and  all 
CPUs  shall  complete  their  jobs  at  about  the  same  time. 
However,  when  the  queuing  system  does  not  allow  all 
CPUs  to  be  launched  at  the  same  time,  with  the  equal 
number  of  ligands  assigned  to  each  CPU,  the  efficiency  of 
DOVIS  is  compromised.  This  may  become  a  problem  with 
a  queuing  system  when  a  large  number  of  CPUs  are 
requested  for  a  calculation.  To  solve  this  problem,  the 
number  of  ligands  assigned  to  each  CPU  should  be  re-dis- 
tributed  in  a  way  that  the  CPUs  launched  earlier  shall 


Page  3  of  4 

(page  number  not  for  citation  purposes) 


BMC  Bioinformatics  2008,  9:126 


http://www.biomedcentral.eom/1 471-21 05/9/1 26 


docked  more  ligands  than  the  CPUs  launched  later.  Cur¬ 
rently,  we  are  working  on  an  improvement  to  address  this 
issue. 

Conclusion 

Using  AutoDock  as  its  docking  engine,  DOVIS  provides 
an  automated  parallel  docking  package  that  is  integrated 
with  a  queuing  system.  This  application  is  suitable  for 
conducting  large-scale  high-throughput  virtual  screening 
on  Linux  cluster  platforms.  The  DOVIS  package  is  freely 
available  (see  additional  file  1  and  2  for  detailed  informa¬ 
tion). 
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